Robotics 39
★ DynamicCity: Large-Scale LiDAR Generation from Dynamic Scenes
LiDAR scene generation has been developing rapidly recently. However,
existing methods primarily focus on generating static and single-frame scenes,
overlooking the inherently dynamic nature of real-world driving environments.
In this work, we introduce DynamicCity, a novel 4D LiDAR generation framework
capable of generating large-scale, high-quality LiDAR scenes that capture the
temporal evolution of dynamic environments. DynamicCity mainly consists of two
key models. 1) A VAE model for learning HexPlane as the compact 4D
representation. Instead of using naive averaging operations, DynamicCity
employs a novel Projection Module to effectively compress 4D LiDAR features
into six 2D feature maps for HexPlane construction, which significantly
enhances HexPlane fitting quality (up to 12.56 mIoU gain). Furthermore, we
utilize an Expansion & Squeeze Strategy to reconstruct 3D feature volumes in
parallel, which improves both network training efficiency and reconstruction
accuracy than naively querying each 3D point (up to 7.05 mIoU gain, 2.06x
training speedup, and 70.84% memory reduction). 2) A DiT-based diffusion model
for HexPlane generation. To make HexPlane feasible for DiT generation, a Padded
Rollout Operation is proposed to reorganize all six feature planes of the
HexPlane as a squared 2D feature map. In particular, various conditions could
be introduced in the diffusion or sampling process, supporting versatile 4D
generation applications, such as trajectory- and command-driven generation,
inpainting, and layout-conditioned generation. Extensive experiments on the
CarlaSC and Waymo datasets demonstrate that DynamicCity significantly
outperforms existing state-of-the-art 4D LiDAR generation methods across
multiple metrics. The code will be released to facilitate future research.
comment: Preprint; 29 pages, 15 figures, 7 tables; Project Page at
https://dynamic-city.github.io/
★ SPIRE: Synergistic Planning, Imitation, and Reinforcement Learning for Long-Horizon Manipulation
Robot learning has proven to be a general and effective technique for
programming manipulators. Imitation learning is able to teach robots solely
from human demonstrations but is bottlenecked by the capabilities of the
demonstrations. Reinforcement learning uses exploration to discover better
behaviors; however, the space of possible improvements can be too large to
start from scratch. And for both techniques, the learning difficulty increases
proportional to the length of the manipulation task. Accounting for this, we
propose SPIRE, a system that first uses Task and Motion Planning (TAMP) to
decompose tasks into smaller learning subproblems and second combines imitation
and reinforcement learning to maximize their strengths. We develop novel
strategies to train learning agents when deployed in the context of a planning
system. We evaluate SPIRE on a suite of long-horizon and contact-rich robot
manipulation problems. We find that SPIRE outperforms prior approaches that
integrate imitation learning, reinforcement learning, and planning by 35% to
50% in average task performance, is 6 times more data efficient in the number
of human demonstrations needed to train proficient agents, and learns to
complete tasks nearly twice as efficiently. View
https://sites.google.com/view/spire-corl-2024 for more details.
comment: Conference on Robot Learning (CoRL) 2024
★ A Pipeline for Segmenting and Structuring RGB-D Data for Robotics Applications
We introduce a novel pipeline for segmenting and structuring color and depth
(RGB-D) data. Existing processing pipelines for RGB-D data have focused on
extracting geometric information alone. This approach precludes the development
of more advanced robotic navigation and manipulation algorithms, which benefit
from a semantic understanding of their environment. Our pipeline can segment
RGB-D data into accurate semantic masks. These masks are then used to fuse raw
captured point clouds into semantically separated point clouds. We store this
information using the Universal Scene Description (USD) file format, a format
suitable for easy querying by downstream robotics algorithms, human-friendly
visualization, and robotics simulation.
★ Robust Two-View Geometry Estimation with Implicit Differentiation IROS 2024
We present a novel two-view geometry estimation framework which is based on a
differentiable robust loss function fitting. We propose to treat the robust
fundamental matrix estimation as an implicit layer, which allows us to avoid
backpropagation through time and significantly improves the numerical
stability. To take full advantage of the information from the feature matching
stage we incorporate learnable weights that depend on the matching confidences.
In this way our solution brings together feature extraction, matching and
two-view geometry estimation in a unified end-to-end trainable pipeline. We
evaluate our approach on the camera pose estimation task in both outdoor and
indoor scenarios. The experiments on several datasets show that the proposed
method outperforms both classic and learning-based state-of-the-art methods by
a large margin. The project webpage is available at:
https://github.com/VladPyatov/ihls
comment: IROS 2024 Accepted
★ Reconfigurable Hydrostatics: Toward Multifunctional and Powerful Wearable Robotics
Wearable and locomotive robot designers face multiple challenges when
choosing actuation. Traditional fully actuated designs using electric motors
are multifunctional but oversized and inefficient for bearing conservative
loads and for being backdrivable. Alternatively, quasi-passive and
underactuated designs reduce the size of motorization and energy storage, but
are often designed for specific tasks. Designers of versatile and stronger
wearable robots will face these challenges unless future actuators become very
torque-dense, backdrivable and efficient.
This paper explores a design paradigm for addressing this issue:
reconfigurable hydrostatics. We show that a hydrostatic actuator can integrate
a passive force mechanism and a sharing mechanism in the fluid domain and still
be multifunctional. First, an analytical study compares how these two
mechanisms can relax the motorization requirements in the context of a
load-bearing exoskeleton. Then, the hydrostatic concept integrating these two
mechanisms using hydraulic components is presented. A case study analysis shows
the mass/efficiency/inertia benefits of the concept over a fully actuated one.
Then, the feasibility of the concept is partially validated with a
proof-of-concept that actuates the knees of an exoskeleton. The experiments
show that it can track the vertical ground reaction force (GRF) profiles of
walking, running, squatting, and jumping, and that the energy consumption is 6x
lower. The transient force behaviors due to switching from one leg to the other
are also analyzed along with some mitigation to improve them.
★ Gaussian Process Distance Fields Obstacle and Ground Constraints for Safe Navigation
Navigating cluttered environments is a challenging task for any mobile
system. Existing approaches for ground-based mobile systems primarily focus on
small wheeled robots, which face minimal constraints with overhanging obstacles
and cannot manage steps or stairs, making the problem effectively 2D. However,
navigation for legged robots (or even humans) has to consider an extra
dimension. This paper proposes a tailored scene representation coupled with an
advanced trajectory optimisation algorithm to enable safe navigation. Our 3D
navigation approach is suitable for any ground-based mobile robot, whether
wheeled or legged, as well as for human assistance. Given a 3D point cloud of
the scene and the segmentation of the ground and non-ground points, we
formulate two Gaussian Process distance fields to ensure a collision-free path
and maintain distance to the ground constraints. Our method adeptly handles
uneven terrain, steps, and overhanging objects through an innovative use of a
quadtree structure, constructing a multi-resolution map of the free space and
its connectivity graph based on a 2D projection of the relevant scene.
Evaluations with both synthetic and real-world datasets demonstrate that this
approach provides safe and smooth paths, accommodating a wide range of
ground-based mobile systems.
★ Scaling Robot Policy Learning via Zero-Shot Labeling with Foundation Models
Nils Blank, Moritz Reuss, Marcel Rühle, Ömer Erdinç Yağmurlu, Fabian Wenzel, Oier Mees, Rudolf Lioutikov
A central challenge towards developing robots that can relate human language
to their perception and actions is the scarcity of natural language annotations
in diverse robot datasets. Moreover, robot policies that follow natural
language instructions are typically trained on either templated language or
expensive human-labeled instructions, hindering their scalability. To this end,
we introduce NILS: Natural language Instruction Labeling for Scalability. NILS
automatically labels uncurated, long-horizon robot data at scale in a zero-shot
manner without any human intervention. NILS combines pretrained vision-language
foundation models in order to detect objects in a scene, detect object-centric
changes, segment tasks from large datasets of unlabelled interaction data and
ultimately label behavior datasets. Evaluations on BridgeV2, Fractal, and a
kitchen play dataset show that NILS can autonomously annotate diverse robot
demonstrations of unlabeled and unstructured datasets while alleviating several
shortcomings of crowdsourced human annotations, such as low data quality and
diversity. We use NILS to label over 115k trajectories obtained from over 430
hours of robot data. We open-source our auto-labeling code and generated
annotations on our website: http://robottasklabeling.github.io.
comment: Project Website at https://robottasklabeling.github.io/
★ Multi-Layered Safety of Redundant Robot Manipulators via Task-Oriented Planning and Control
Ensuring safety is crucial to promote the application of robot manipulators
in open workspace. Factors such as sensor errors or unpredictable collisions
make the environment full of uncertainties. In this work, we investigate these
potential safety challenges on redundant robot manipulators, and propose a
task-oriented planning and control framework to achieve multi-layered safety
while maintaining efficient task execution. Our approach consists of two main
parts: a task-oriented trajectory planner based on multiple-shooting model
predictive control method, and a torque controller that allows safe and
efficient collision reaction using only proprioceptive data. Through extensive
simulations and real-hardware experiments, we demonstrate that the proposed
framework can effectively handle uncertain static or dynamic obstacles, and
perform disturbance resistance in manipulation tasks when unforeseen contacts
occur. All code will be open-sourced to benefit the community.
comment: 7 pages, 8 figures. This work has been submitted to the IEEE for
possible publication
★ Towards Safer Planetary Exploration: A Hybrid Architecture for Terrain Traversability Analysis in Mars Rovers
The field of autonomous navigation for unmanned ground vehicles (UGVs) is in
continuous growth and increasing levels of autonomy have been reached in the
last few years. However, the task becomes more challenging when the focus is on
the exploration of planet surfaces such as Mars. In those situations, UGVs are
forced to navigate through unstable and rugged terrains which, inevitably, open
the vehicle to more hazards, accidents, and, in extreme cases, complete mission
failure. The paper addresses the challenges of autonomous navigation for
unmanned ground vehicles in planetary exploration, particularly on Mars,
introducing a hybrid architecture for terrain traversability analysis that
combines two approaches: appearance-based and geometry-based. The
appearance-based method uses semantic segmentation via deep neural networks to
classify different terrain types. This is further refined by pixel-level
terrain roughness classification obtained from the same RGB image, assigning
different costs based on the physical properties of the soil. The
geometry-based method complements the appearance-based approach by evaluating
the terrain's geometrical features, identifying hazards that may not be
detectable by the appearance-based side. The outputs of both methods are
combined into a comprehensive hybrid cost map. The proposed architecture was
trained on synthetic datasets and developed as a ROS2 application to integrate
into broader autonomous navigation systems for harsh environments. Simulations
have been performed in Unity, showing the ability of the method to assess
online traversability analysis.
★ Markov Potential Game with Final-time Reach-Avoid Objectives
We formulate a Markov potential game with final-time reach-avoid objectives
by integrating potential game theory with stochastic reach-avoid control. Our
focus is on multi-player trajectory planning where players maximize the same
multi-player reach-avoid objective: the probability of all participants
reaching their designated target states by a specified time, while avoiding
collisions with one another. Existing approaches require centralized
computation of actions via a global policy, which may have prohibitively
expensive communication costs. Instead, we focus on approximations of the
global policy via local state feedback policies. First, we adapt the recursive
single player reach-avoid value iteration to the multi-player framework with
local policies, and show that the same recursion holds on the joint state
space. To find each player's optimal local policy, the multi-player reach-avoid
value function is projected from the joint state to the local state using the
other players' occupancy measures. Then, we propose an iterative best response
scheme for the multi-player value iteration to converge to a pure Nash
equilibrium. We demonstrate the utility of our approach in finding
collision-free policies for multi-player motion planning in simulation.
comment: 8 pages, 2 figures
★ Human-Robot Collaboration System Setup for Weed Harvesting Scenarios in Aquatic Lakes IROS 2024
Artificial Water Bodies (AWBs) are human-made and require continuous
monitoring due to their artificial biological processes. These systems
necessitate regular maintenance to manage their ecosystems effectively.
Unmanned Surface Vehicle (USV) offers a collaborative approach for monitoring
these environments, working alongside human operators such as boat skippers to
identify specific locations. This paper discusses a weed harvesting scenario,
demonstrating how human-robot collaboration can be achieved, supported by
preliminary results. The USV mainly utilises multibeam SOund NAvigation and
Ranging (SONAR) for underwater weed monitoring, showing promising outcomes in
these scenarios.
comment: 3 pages, 5 figures. This paper was accepted for poster presentation
at IROS 2024 Workshop on Maritime Heteregenous Unmanned Robotic Systems
(MHURS)
★ Incremental Learning of Affordances using Markov Logic Networks
Affordances enable robots to have a semantic understanding of their
surroundings. This allows them to have more acting flexibility when completing
a given task. Capturing object affordances in a machine learning model is a
difficult task, because of their dependence on contextual information. Markov
Logic Networks (MLN) combine probabilistic reasoning with logic that is able to
capture such context. Mobile robots operate in partially known environments
wherein unseen object affordances can be observed. This new information must be
incorporated into the existing knowledge, without having to retrain the MLN
from scratch. We introduce the MLN Cumulative Learning Algorithm (MLN-CLA).
MLN-CLA learns new relations in various knowledge domains by retaining
knowledge and only updating the changed knowledge, for which the MLN is
retrained. We show that MLN-CLA is effective for accumulative learning and
zero-shot affordance inference, outperforming strong baselines.
comment: accepted at IEEE IRC 2024
★ ImDy: Human Inverse Dynamics from Imitated Observations
Inverse dynamics (ID), which aims at reproducing the driven torques from
human kinematic observations, has been a critical tool for gait analysis.
However, it is hindered from wider application to general motion due to its
limited scalability. Conventional optimization-based ID requires expensive
laboratory setups, restricting its availability. To alleviate this problem, we
propose to exploit the recently progressive human motion imitation algorithms
to learn human inverse dynamics in a data-driven manner. The key insight is
that the human ID knowledge is implicitly possessed by motion imitators, though
not directly applicable. In light of this, we devise an efficient data
collection pipeline with state-of-the-art motion imitation algorithms and
physics simulators, resulting in a large-scale human inverse dynamics benchmark
as Imitated Dynamics (ImDy). ImDy contains over 150 hours of motion with joint
torque and full-body ground reaction force data. With ImDy, we train a
data-driven human inverse dynamics solver ImDyS(olver) in a fully supervised
manner, which conducts ID and ground reaction force estimation simultaneously.
Experiments on ImDy and real-world data demonstrate the impressive competency
of ImDyS in human inverse dynamics and ground reaction force estimation.
Moreover, the potential of ImDy(-S) as a fundamental motion analysis tool is
exhibited with downstream applications. The project page is
https://foruck.github.io/ImDy/.
comment: Yong-Lu Li and Cewu Lu are the corresponding authors
★ Integrating Large Language Models for UAV Control in Simulated Environments: A Modular Interaction Approach
The intersection of LLMs (Large Language Models) and UAV (Unoccupied Aerial
Vehicles) technology represents a promising field of research with the
potential to enhance UAV capabilities significantly. This study explores the
application of LLMs in UAV control, focusing on the opportunities for
integrating advanced natural language processing into autonomous aerial
systems. By enabling UAVs to interpret and respond to natural language
commands, LLMs simplify the UAV control and usage, making them accessible to a
broader user base and facilitating more intuitive human-machine interactions.
The paper discusses several key areas where LLMs can impact UAV technology,
including autonomous decision-making, dynamic mission planning, enhanced
situational awareness, and improved safety protocols. Through a comprehensive
review of current developments and potential future directions, this study aims
to highlight how LLMs can transform UAV operations, making them more adaptable,
responsive, and efficient in complex environments. A template development
framework for integrating LLMs in UAV control is also described. Proof of
Concept results that integrate existing LLM models and popular robotic
simulation platforms are demonstrated. The findings suggest that while there
are substantial technical and ethical challenges to address, integrating LLMs
into UAV control holds promising implications for advancing autonomous aerial
systems.
★ Energy-Optimal Planning of Waypoint-Based UAV Missions -- Does Minimum Distance Mean Minimum Energy? IROS
Multirotor unmanned aerial vehicle is a prevailing type of aerial robots with
wide real-world applications. The energy efficiency of the robot is a critical
aspect of its performance, determining the range and duration of the missions
that can be performed. This paper studies the energy-optimal planning of the
multirotor, which aims at finding the optimal ordering of waypoints with the
minimum energy consumption for missions in 3D space. The study is performed
based on a previously developed model capturing first-principle energy dynamics
of the multirotor. We found that in majority of the cases (up to 95%) the
solutions of the energy-optimal planning are different from those of the
traditional traveling salesman problem which minimizes the total distance. The
difference can be as high as 14.9%, with the average at 1.6%-3.3% and 90th
percentile at 3.7%-6.5% depending on the range and number of waypoints in the
mission. We then identified and explained the key features of the
minimum-energy order by correlating to the underlying flight energy dynamics.
It is shown that instead of minimizing the distance, coordination of vertical
and horizontal motion to promote aerodynamic efficiency is the key to
optimizing energy consumption.
comment: This paper has been accepted for presentation at the IEEE/RSJ
International Conference on Intelligent Robots and Systems (IROS) 2024
★ Real-time Vehicle-to-Vehicle Communication Based Network Cooperative Control System through Distributed Database and Multimodal Perception: Demonstrated in Crossroads
The autonomous driving industry is rapidly advancing, with Vehicle-to-Vehicle
(V2V) communication systems highlighting as a key component of enhanced road
safety and traffic efficiency. This paper introduces a novel Real-time
Vehicle-to-Vehicle Communication Based Network Cooperative Control System
(VVCCS), designed to revolutionize macro-scope traffic planning and collision
avoidance in autonomous driving. Implemented on Quanser Car (Qcar) hardware
platform, our system integrates the distributed databases into individual
autonomous vehicles and an optional central server. We also developed a
comprehensive multi-modal perception system with multi-objective tracking and
radar sensing. Through a demonstration within a physical crossroad environment,
our system showcases its potential to be applied in congested and complex urban
environments.
comment: ICICT 2024, 18 pages
★ Multimodal Information Bottleneck for Deep Reinforcement Learning with Multiple Sensors
Reinforcement learning has achieved promising results on robotic control
tasks but struggles to leverage information effectively from multiple sensory
modalities that differ in many characteristics. Recent works construct
auxiliary losses based on reconstruction or mutual information to extract joint
representations from multiple sensory inputs to improve the sample efficiency
and performance of reinforcement learning algorithms. However, the
representations learned by these methods could capture information irrelevant
to learning a policy and may degrade the performance. We argue that compressing
information in the learned joint representations about raw multimodal
observations is helpful, and propose a multimodal information bottleneck model
to learn task-relevant joint representations from egocentric images and
proprioception. Our model compresses and retains the predictive information in
multimodal observations for learning a compressed joint representation, which
fuses complementary information from visual and proprioceptive feedback and
meanwhile filters out task-irrelevant information in raw multimodal
observations. We propose to minimize the upper bound of our multimodal
information bottleneck objective for computationally tractable optimization.
Experimental evaluations on several challenging locomotion tasks with
egocentric images and proprioception show that our method achieves better
sample efficiency and zero-shot robustness to unseen white noise than leading
baselines. We also empirically demonstrate that leveraging information from
egocentric images and proprioception is more helpful for learning policies on
locomotion tasks than solely using one single modality.
comment: 31 pages
★ Generalizable Motion Planning via Operator Learning
In this work, we introduce a planning neural operator (PNO) for predicting
the value function of a motion planning problem. We recast value function
approximation as learning a single operator from the cost function space to the
value function space, which is defined by an Eikonal partial differential
equation (PDE). Specifically, we recast computing value functions as learning a
single operator across continuous function spaces which prove is equivalent to
solving an Eikonal PDE. Through this reformulation, our learned PNO is able to
generalize to new motion planning problems without retraining. Therefore, our
PNO model, despite being trained with a finite number of samples at coarse
resolution, inherits the zero-shot super-resolution property of neural
operators. We demonstrate accurate value function approximation at 16 times the
training resolution on the MovingAI lab's 2D city dataset and compare with
state-of-the-art neural value function predictors on 3D scenes from the iGibson
building dataset. Lastly, we investigate employing the value function output of
PNO as a heuristic function to accelerate motion planning. We show
theoretically that the PNO heuristic is $\epsilon$-consistent by introducing an
inductive bias layer that guarantees our value functions satisfy the triangle
inequality. With our heuristic, we achieve a 30% decrease in nodes visited
while obtaining near optimal path lengths on the MovingAI lab 2D city dataset,
compared to classical planning methods (A*, RRT*).
★ Mechanisms and Computational Design of Multi-Modal End-Effector with Force Sensing using Gated Networks
In limbed robotics, end-effectors must serve dual functions, such as both
feet for locomotion and grippers for grasping, which presents design
challenges. This paper introduces a multi-modal end-effector capable of
transitioning between flat and line foot configurations while providing
grasping capabilities. MAGPIE integrates 8-axis force sensing using proposed
mechanisms with hall effect sensors, enabling both contact and tactile force
measurements. We present a computational design framework for our sensing
mechanism that accounts for noise and interference, allowing for desired
sensitivity and force ranges and generating ideal inverse models. The hardware
implementation of MAGPIE is validated through experiments, demonstrating its
capability as a foot and verifying the performance of the sensing mechanisms,
ideal models, and gated network-based models.
★ X-MOBILITY: End-To-End Generalizable Navigation via World Modeling
General-purpose navigation in challenging environments remains a significant
problem in robotics, with current state-of-the-art approaches facing myriad
limitations. Classical approaches struggle with cluttered settings and require
extensive tuning, while learning-based methods face difficulties generalizing
to out-of-distribution environments. This paper introduces X-Mobility, an
end-to-end generalizable navigation model that overcomes existing challenges by
leveraging three key ideas. First, X-Mobility employs an auto-regressive world
modeling architecture with a latent state space to capture world dynamics.
Second, a diverse set of multi-head decoders enables the model to learn a rich
state representation that correlates strongly with effective navigation skills.
Third, by decoupling world modeling from action policy, our architecture can
train effectively on a variety of data sources, both with and without expert
policies: off-policy data allows the model to learn world dynamics, while
on-policy data with supervisory control enables optimal action policy learning.
Through extensive experiments, we demonstrate that X-Mobility not only
generalizes effectively but also surpasses current state-of-the-art navigation
approaches. Additionally, X-Mobility also achieves zero-shot Sim2Real
transferability and shows strong potential for cross-embodiment generalization.
★ GenDP: 3D Semantic Fields for Category-Level Generalizable Diffusion Policy
Diffusion-based policies have shown remarkable capability in executing
complex robotic manipulation tasks but lack explicit characterization of
geometry and semantics, which often limits their ability to generalize to
unseen objects and layouts. To enhance the generalization capabilities of
Diffusion Policy, we introduce a novel framework that incorporates explicit
spatial and semantic information via 3D semantic fields. We generate 3D
descriptor fields from multi-view RGBD observations with large foundational
vision models, then compare these descriptor fields against reference
descriptors to obtain semantic fields. The proposed method explicitly considers
geometry and semantics, enabling strong generalization capabilities in tasks
requiring category-level generalization, resolving geometric ambiguities, and
attention to subtle geometric details. We evaluate our method across eight
tasks involving articulated objects and instances with varying shapes and
textures from multiple object categories. Our method demonstrates its
effectiveness by increasing Diffusion Policy's average success rate on unseen
instances from 20% to 93%. Additionally, we provide a detailed analysis and
visualization to interpret the sources of performance gain and explain how our
method can generalize to novel instances.
comment: Accepted to Conference on Robot Learning (CoRL 2024). Project Page:
https://robopil.github.io/GenDP/
♻ ★ JointMotion: Joint Self-Supervision for Joint Motion Prediction
We present JointMotion, a self-supervised pre-training method for joint
motion prediction in self-driving vehicles. Our method jointly optimizes a
scene-level objective connecting motion and environments, and an instance-level
objective to refine learned representations. Scene-level representations are
learned via non-contrastive similarity learning of past motion sequences and
environment context. At the instance level, we use masked autoencoding to
refine multimodal polyline representations. We complement this with an adaptive
pre-training decoder that enables JointMotion to generalize across different
environment representations, fusion mechanisms, and dataset characteristics.
Notably, our method reduces the joint final displacement error of Wayformer,
HPTR, and Scene Transformer models by 3\%, 8\%, and 12\%, respectively; and
enables transfer learning between the Waymo Open Motion and the Argoverse 2
Motion Forecasting datasets. Code: https://github.com/kit-mrt/future-motion
comment: CoRL'24 camera-ready
♻ ★ UniSaT: Unified-Objective Belief Model and Planner to Search for and Track Multiple Objects SC
Path planning for autonomous search and tracking of multiple objects is a
critical problem in applications such as reconnaissance, surveillance, and data
gathering. Due to the inherent competing objectives of searching for new
objects while maintaining tracks for found objects, most current approaches
rely on multi-objective planning methods, leaving it up to the user to tune
parameters to balance between the two objectives, usually based on heuristics
or trial and error. In this paper, we introduce UniSaT (Unified Search and
Track), a novel unified-objective formulation for the search and track problem
based on Random Finite Sets (RFS). Our approach models unknown and known
objects using a combined generalized labeled multi-Bernoulli (GLMB) filter. For
unseen objects, UniSaT leverages both cardinality and spatial prior
distributions, allowing it to operate without prior knowledge of the exact
number of objects in the search space. The planner maximizes the mutual
information of this unified belief model, creating balanced search and tracking
behaviors. We demonstrate our work in a simulated environment, presenting both
qualitative results and quantitative improvements over a multi-objective
method.
comment: 13 pages, AIAA SCITECH 2025 Forum
♻ ★ DexGrasp-Diffusion: Diffusion-based Unified Functional Grasp Synthesis Method for Multi-Dexterous Robotic Hands
Zhengshen Zhang, Lei Zhou, Chenchen Liu, Zhiyang Liu, Chengran Yuan, Sheng Guo, Ruiteng Zhao, Marcelo H. Ang Jr., Francis EH Tay
The versatility and adaptability of human grasping catalyze advancing
dexterous robotic manipulation. While significant strides have been made in
dexterous grasp generation, current research endeavors pivot towards optimizing
object manipulation while ensuring functional integrity, emphasizing the
synthesis of functional grasps following desired affordance instructions. This
paper addresses the challenge of synthesizing functional grasps tailored to
diverse dexterous robotic hands by proposing DexGrasp-Diffusion, an end-to-end
modularized diffusion-based method. DexGrasp-Diffusion integrates
MultiHandDiffuser, a novel unified data-driven diffusion model for
multi-dexterous hands grasp estimation, with DexDiscriminator, which employs a
Physics Discriminator and a Functional Discriminator with open-vocabulary
setting to filter physically plausible functional grasps based on object
affordances. The experimental evaluation conducted on the MultiDex dataset
provides substantiating evidence supporting the superior performance of
MultiHandDiffuser over the baseline model in terms of success rate, grasp
diversity, and collision depth. Moreover, we demonstrate the capacity of
DexGrasp-Diffusion to reliably generate functional grasps for household objects
aligned with specific affordance instructions.
comment: 15 pages, 5 figures
♻ ★ ODTFormer: Efficient Obstacle Detection and Tracking with Stereo Cameras Based on Transformer IROS 2024
Obstacle detection and tracking represent a critical component in robot
autonomous navigation. In this paper, we propose ODTFormer, a Transformer-based
model to address both obstacle detection and tracking problems. For the
detection task, our approach leverages deformable attention to construct a 3D
cost volume, which is decoded progressively in the form of voxel occupancy
grids. We further track the obstacles by matching the voxels between
consecutive frames. The entire model can be optimized in an end-to-end manner.
Through extensive experiments on DrivingStereo and KITTI benchmarks, our model
achieves state-of-the-art performance in the obstacle detection task. We also
report comparable accuracy to state-of-the-art obstacle tracking models while
requiring only a fraction of their computation cost, typically ten-fold to
twenty-fold less. The code and model weights will be publicly released.
comment: 8 pages. Accepted by IROS 2024
♻ ★ The Art of Imitation: Learning Long-Horizon Manipulation Tasks from Few Demonstrations
Task Parametrized Gaussian Mixture Models (TP-GMM) are a sample-efficient
method for learning object-centric robot manipulation tasks. However, there are
several open challenges to applying TP-GMMs in the wild. In this work, we
tackle three crucial challenges synergistically. First, end-effector velocities
are non-Euclidean and thus hard to model using standard GMMs. We thus propose
to factorize the robot's end-effector velocity into its direction and
magnitude, and model them using Riemannian GMMs. Second, we leverage the
factorized velocities to segment and sequence skills from complex demonstration
trajectories. Through the segmentation, we further align skill trajectories and
hence leverage time as a powerful inductive bias. Third, we present a method to
automatically detect relevant task parameters per skill from visual
observations. Our approach enables learning complex manipulation tasks from
just five demonstrations while using only RGB-D observations. Extensive
experimental evaluations on RLBench demonstrate that our approach achieves
state-of-the-art performance with 20-fold improved sample efficiency. Our
policies generalize across different environments, object instances, and object
positions, while the learned skills are reusable.
♻ ★ Flying through Moving Gates without Full State Estimation
Autonomous drone racing requires powerful perception, planning, and control
and has become a benchmark and test field for autonomous, agile flight.
Existing work usually assumes static race tracks with known maps, which enables
offline planning of time-optimal trajectories, performing localization to the
gates to reduce the drift in visual-inertial odometry (VIO) for state
estimation or training learning-based methods for the particular race track and
operating environment. In contrast, many real-world tasks like disaster
response or delivery need to be performed in unknown and dynamic environments.
To close this gap and make drone racing more robust against unseen environments
and moving gates, we propose a control algorithm that does not require a race
track map or VIO and uses only monocular measurements of the line of sight
(LOS) to the gates. For this purpose, we adopt the law of proportional
navigation (PN) to accurately fly through the gates despite gate motions or
wind. We formulate the PN-informed vision-based control problem for drone
racing as a constrained optimization problem and derive a closed-form optimal
solution. We demonstrate through extensive simulations and real-world
experiments that our method can navigate through moving gates at high speeds
while being robust to different gate movements, model errors, wind, and delays.
comment: 7 pages, 6 figures
♻ ★ Cross-Category Functional Grasp Transfer
Generating grasps for a dexterous hand often requires numerous grasping
annotations. However, annotating high DoF dexterous hand poses is quite
challenging. Especially for functional grasps, requiring the hand to grasp the
object in a specific pose to facilitate subsequent manipulations. This prompts
us to explore how people achieve manipulations on new objects based on past
grasp experiences. We find that when grasping new items, people are adept at
discovering and leveraging various similarities between objects, including
shape, layout, and grasp type. Considering this, we analyze and collect
grasp-related similarity relationships among 51 common tool-like object
categories and annotate semantic grasp representation for 1768 objects. These
objects are connected through similarities to form a knowledge graph, which
helps infer our proposed cross-category functional grasp synthesis. Through
extensive experiments, we demonstrate that the grasp-related knowledge indeed
contributed to achieving functional grasp transfer across unknown or entirely
new categories of objects.
♻ ★ Gaussian-Informed Continuum for Physical Property Identification and Simulation NeurIPS 2024
This paper studies the problem of estimating physical properties (system
identification) through visual observations. To facilitate geometry-aware
guidance in physical property estimation, we introduce a novel hybrid framework
that leverages 3D Gaussian representation to not only capture explicit shapes
but also enable the simulated continuum to render object masks as 2D shape
surrogates during training.
We propose a new dynamic 3D Gaussian framework based on motion factorization
to recover the object as 3D Gaussian point sets across different time states.
Furthermore, we develop a coarse-to-fine filling strategy to generate the
density fields of the object from the Gaussian reconstruction, allowing for the
extraction of object continuums along with their surfaces and the integration
of Gaussian attributes into these continuums.
In addition to the extracted object surfaces, the Gaussian-informed continuum
also enables the rendering of object masks during simulations, serving as
2D-shape guidance for physical property estimation.
Extensive experimental evaluations demonstrate that our pipeline achieves
state-of-the-art performance across multiple benchmarks and metrics.
Additionally, we illustrate the effectiveness of the proposed method through
real-world demonstrations, showcasing its practical utility.
Our project page is at https://jukgei.github.io/project/gic.
comment: 21 pages, 8 figures, NeurIPS 2024
♻ ★ Diffusion-Reward Adversarial Imitation Learning
Imitation learning aims to learn a policy from observing expert
demonstrations without access to reward signals from environments. Generative
adversarial imitation learning (GAIL) formulates imitation learning as
adversarial learning, employing a generator policy learning to imitate expert
behaviors and discriminator learning to distinguish the expert demonstrations
from agent trajectories. Despite its encouraging results, GAIL training is
often brittle and unstable. Inspired by the recent dominance of diffusion
models in generative modeling, we propose Diffusion-Reward Adversarial
Imitation Learning (DRAIL), which integrates a diffusion model into GAIL,
aiming to yield more robust and smoother rewards for policy learning.
Specifically, we propose a diffusion discriminative classifier to construct an
enhanced discriminator, and design diffusion rewards based on the classifier's
output for policy learning. Extensive experiments are conducted in navigation,
manipulation, and locomotion, verifying DRAIL's effectiveness compared to prior
imitation learning methods. Moreover, additional experimental results
demonstrate the generalizability and data efficiency of DRAIL. Visualized
learned reward functions of GAIL and DRAIL suggest that DRAIL can produce more
robust and smoother rewards. Project page:
https://nturobotlearninglab.github.io/DRAIL/
♻ ★ Learning to Manipulate Anywhere: A Visual Generalizable Framework For Reinforcement Learning
Can we endow visuomotor robots with generalization capabilities to operate in
diverse open-world scenarios? In this paper, we propose \textbf{Maniwhere}, a
generalizable framework tailored for visual reinforcement learning, enabling
the trained robot policies to generalize across a combination of multiple
visual disturbance types. Specifically, we introduce a multi-view
representation learning approach fused with Spatial Transformer Network (STN)
module to capture shared semantic information and correspondences among
different viewpoints. In addition, we employ a curriculum-based randomization
and augmentation approach to stabilize the RL training process and strengthen
the visual generalization ability. To exhibit the effectiveness of Maniwhere,
we meticulously design 8 tasks encompassing articulate objects, bi-manual, and
dexterous hand manipulation tasks, demonstrating Maniwhere's strong visual
generalization and sim2real transfer abilities across 3 hardware platforms. Our
experiments show that Maniwhere significantly outperforms existing
state-of-the-art methods. Videos are provided at
https://gemcollector.github.io/maniwhere/.
comment: Webpage: https://gemcollector.github.io/maniwhere/
♻ ★ Exploring Self-Supervised Skeleton-Based Human Action Recognition under Occlusions
Yifei Chen, Kunyu Peng, Alina Roitberg, David Schneider, Jiaming Zhang, Junwei Zheng, Ruiping Liu, Yufan Chen, Kailun Yang, Rainer Stiefelhagen
To integrate self-supervised skeleton-based action recognition methods into
autonomous robotic systems, it is crucial to consider adverse situations
involving target occlusions. Such a scenario, despite its practical relevance,
is rarely addressed in existing self-supervised skeleton-based action
recognition methods. To empower models with the capacity to address occlusion,
we propose a simple and effective method. We first pre-train using occluded
skeleton sequences, then use k-means clustering (KMeans) on sequence embeddings
to group semantically similar samples. Next, we propose KNN-Imputation to fill
in missing skeleton data based on the closest sample neighbors. Imputing
incomplete skeleton sequences to create relatively complete sequences as input
provides significant benefits to existing skeleton-based self-supervised
methods. Meanwhile, building on the state-of-the-art Partial Spatio-Temporal
Learning (PSTL), we introduce an Occluded Partial Spatio-Temporal Learning
(OPSTL) framework. This enhancement utilizes Adaptive Spatial Masking (ASM) for
better use of high-quality, intact skeletons. The new proposed method is
verified on the challenging occluded versions of the NTURGB+D 60 and NTURGB+D
120. The source code is publicly available at https://github.com/cyfml/OPSTL.
comment: The source code is publicly available at
https://github.com/cyfml/OPSTL
♻ ★ Interactive Distance Field Mapping and Planning to Enable Human-Robot Collaboration
Human-robot collaborative applications require scene representations that are
kept up-to-date and facilitate safe motions in dynamic scenes. In this letter,
we present an interactive distance field mapping and planning (IDMP) framework
that handles dynamic objects and collision avoidance through an efficient
representation. We define interactive mapping and planning as the process of
creating and updating the representation of the scene online while
simultaneously planning and adapting the robot's actions based on that
representation. The key aspect of this work is an efficient Gaussian Process
field that performs incremental updates and handles dynamic objects reliably by
identifying moving points via a simple and elegant formulation based on queries
from a temporary latent model. In terms of mapping, IDMP is able to fuse point
cloud data from single and multiple sensors, query the free space at any
spatial resolution, and deal with moving objects without semantics. In terms of
planning, IDMP allows seamless integration with gradient-based reactive
planners facilitating dynamic obstacle avoidance for safe human-robot
interactions. Our mapping performance is evaluated on both real and synthetic
datasets. A comparison with similar state-of-the-art frameworks shows superior
performance when handling dynamic objects and comparable or better performance
in the accuracy of the computed distance and gradient field. Finally, we show
how the framework can be used for fast motion planning in the presence of
moving objects both in simulated and real-world scenes. An accompanying video,
code, and datasets are made publicly available https://uts-ri.github.io/IDMP.
♻ ★ Real-World Robot Applications of Foundation Models: A Review
Recent developments in foundation models, like Large Language Models (LLMs)
and Vision-Language Models (VLMs), trained on extensive data, facilitate
flexible application across different tasks and modalities. Their impact spans
various fields, including healthcare, education, and robotics. This paper
provides an overview of the practical application of foundation models in
real-world robotics, with a primary emphasis on the replacement of specific
components within existing robot systems. The summary encompasses the
perspective of input-output relationships in foundation models, as well as
their role in perception, motion planning, and control within the field of
robotics. This paper concludes with a discussion of future challenges and
implications for practical robot applications.
♻ ★ Log-GPIS-MOP: A Unified Representation for Mapping, Odometry and Planning
Whereas dedicated scene representations are required for each different task
in conventional robotic systems, this paper demonstrates that a unified
representation can be used directly for multiple key tasks. We propose the
Log-Gaussian Process Implicit Surface for Mapping, Odometry and Planning
(Log-GPIS-MOP): a probabilistic framework for surface reconstruction,
localisation and navigation based on a unified representation. Our framework
applies a logarithmic transformation to a Gaussian Process Implicit Surface
(GPIS) formulation to recover a global representation that accurately captures
the Euclidean distance field with gradients and, at the same time, the implicit
surface. By directly estimating the distance field and its gradient through
Log-GPIS inference, the proposed incremental odometry technique computes the
optimal alignment of an incoming frame and fuses it globally to produce a map.
Concurrently, an optimisation-based planner computes a safe collision-free path
using the same Log-GPIS surface representation. We validate the proposed
framework on simulated and real datasets in 2D and 3D and benchmark against the
state-of-the-art approaches. Our experiments show that Log-GPIS-MOP produces
competitive results in sequential odometry, surface mapping and obstacle
avoidance.
♻ ★ The Dark Side of Rich Rewards: Understanding and Mitigating Noise in VLM Rewards
While Vision-Language Models (VLMs) are increasingly used to generate reward
signals for training embodied agents to follow instructions, our research
reveals that agents guided by VLM rewards often underperform compared to those
employing only intrinsic (exploration-driven) rewards, contradicting
expectations set by recent work. We hypothesize that false positive rewards --
instances where unintended trajectories are incorrectly rewarded -- are more
detrimental than false negatives. Our analysis confirms this hypothesis,
revealing that the widely used cosine similarity metric is prone to false
positive reward estimates. To address this, we introduce BiMI ({Bi}nary
{M}utual {I}nformation), a novel reward function designed to mitigate noise.
BiMI significantly enhances learning efficiency across diverse and challenging
embodied navigation environments. Our findings offer a nuanced understanding of
how different types of reward noise impact agent learning and highlight the
importance of addressing multimodal reward signal noise when training embodied
agents
comment: 10 main body pages, 11 appendix pages
♻ ★ OA-MPC: Occlusion-Aware MPC for Guaranteed Safe Robot Navigation with Unseen Dynamic Obstacles
For safe navigation in dynamic uncertain environments, robotic systems rely
on the perception and prediction of other agents. Particularly, in occluded
areas where cameras and LiDAR give no data, the robot must be able to reason
about potential movements of invisible dynamic agents. This work presents a
provably safe motion planning scheme for real-time navigation in an a priori
unmapped environment, where occluded dynamic agents are present. Safety
guarantees are provided based on reachability analysis. Forward reachable sets
associated with potential occluded agents, such as pedestrians, are computed
and incorporated into planning. An iterative optimization-based planner is
presented that alternates between two optimizations: nonlinear Model Predictive
Control (NMPC) and collision avoidance. Recursive feasibility of the MPC is
guaranteed by introducing a terminal stopping constraint. The effectiveness of
the proposed algorithm is demonstrated through simulation studies and hardware
experiments with a TurtleBot robot. A video of experimental results is
available at \url{https://youtu.be/OUnkB5Feyuk}.
♻ ★ OrionNav: Online Planning for Robot Autonomy with Context-Aware LLM and Open-Vocabulary Semantic Scene Graphs
Venkata Naren Devarakonda, Raktim Gautam Goswami, Ali Umut Kaypak, Naman Patel, Rooholla Khorrambakht, Prashanth Krishnamurthy, Farshad Khorrami
Enabling robots to autonomously navigate unknown, complex, dynamic
environments and perform diverse tasks remains a fundamental challenge in
developing robust autonomous physical agents. These agents must effectively
perceive their surroundings while leveraging world knowledge for
decision-making. Although recent approaches utilize vision-language and large
language models for scene understanding and planning, they often rely on
offline processing, offboard compute, make simplifying assumptions about the
environment and perception, limiting real-world applicability. We present a
novel framework for real-time onboard autonomous navigation in unknown
environments that change over time by integrating multi-level abstraction in
both perception and planning pipelines. Our system fuses data from multiple
onboard sensors for localization and mapping and integrates it with
open-vocabulary semantics to generate hierarchical scene graphs from
continuously updated semantic object map. The LLM-based planner uses these
graphs to create multi-step plans that guide low-level controllers in executing
navigation tasks specified in natural language. The system's real-time
operation enables the LLM to adjust its plans based on updates to the scene
graph and task execution status, ensuring continuous adaptation to new
situations or when the current plan cannot accomplish the task, a key advantage
over static or rule-based systems. We demonstrate our system's efficacy on a
quadruped navigating dynamic environments, showcasing its adaptability and
robustness in diverse scenarios.
♻ ★ Piecewise Stochastic Barrier Functions
This paper presents a novel stochastic barrier function (SBF) framework for
safety analysis of stochastic systems based on piecewise (PW) functions. We
first outline a general formulation of PW-SBFs. Then, we focus on PW-Constant
(PWC) SBFs and show how their simplicity yields computational advantages for
general stochastic systems. Specifically, we prove that synthesis of PWC-SBFs
reduces to a minimax optimization problem. Then, we introduce three efficient
algorithms to solve this problem, each offering distinct advantages and
disadvantages. The first algorithm is based on dual linear programming (LP),
which provides an exact solution to the minimax optimization problem. The
second is a more scalable algorithm based on iterative counter-example guided
synthesis, which involves solving two smaller LPs. The third algorithm solves
the minimax problem using gradient descent, which admits even better
scalability. We provide an extensive evaluation of these methods on various
case studies, including neural network dynamic models, nonlinear switched
systems, and high-dimensional linear systems. Our benchmarks demonstrate that
PWC-SBFs outperform state-of-the-art methods, namely sum-of-squares and neural
barrier functions, and can scale to eight dimensional systems.